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Abstract 

Part models of object categories are essential for chal¬ 
lenging recognition tasks, where differences in categories 
are subtle and only reflected in appearances of small parts 
of the object. We present an approach that is able to 
learn part models in a completely unsupervised manner, 
without part annotations and even without given bound¬ 
ing boxes during learning. The key idea is to find constel¬ 
lations of neural activation patterns computed using con¬ 
volutional neural networks. In our experiments, we out¬ 
perform existing approaches for fine-grained recognition 
on the CUB200-2011, NA birds, Oxford PETS, and Ox¬ 
ford Flowers dataset in case no part or bounding box an¬ 
notations are available and achieve state-of-the-art perfor¬ 
mance for the Stanford Dog dataset. We also show the ben¬ 
efits of neural constellation models as a data augmentation 
technique for fine-tuning. Furthermore, our paper unites 
the areas of generic and fine-grained classification, since 
our approach is suitable for both scenarios. 
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Figure 1. Overview of our approach. Deep neural activation maps 
are used to exploit the channels of a CNN as a part detector. We 
estimate a part model from completely unsupervised data by se¬ 
lecting part detectors that fire at similar relative locations. The 
created part models are then used to extract features at object parts 
for weakly-supervised classification. 


1. Introduction 

Object parts play a crucial role in many recent ap¬ 
proaches for fine-grained recognition. They allow for 
capturing very localized discriminative features of an ob¬ 
ject [18]. Learning part models is often either done in a 
completely supervised manner by providing part annota¬ 
tions [7, 40] or labeled bounding boxes [15, 29]. 

In contrast, we show how to learn part-models in a com¬ 
pletely unsupervised manner, which drastically reduces an¬ 
notation costs for learning. Our approach is based on learn¬ 
ing constellations of neural activation patterns obtained 
from pre-learned convolutional neural networks (CNN). 
Fig. 1 shows an overview of our approach. Our part hy¬ 
potheses are outputs of an intermediate CNN layer for 
which we compute neural activation maps [29, 3< ]. Unsu¬ 
pervised part models are either build by randomly selecting 
a subset of the part hypotheses or learned by estimating the 
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parameters of a generative spatial part model. In the lat¬ 
ter case, we implicitly find subsets of part hypotheses that 
“fire” consistently in a certain constellation in the images. 

Although creating a model for the spatial relationship of 
parts has already been introduced a decade ago [16, 14], 
these approaches face major difficulties due to the fact 
that part proposals are based on hand-engineered local de¬ 
scriptors and detectors without correspondence We over¬ 
come this problem by using implicit part detectors of a 
pre-learned CNN, which at the same time greatly simpli¬ 
fies the part-model training. As shown by [38], intermedi¬ 
ate CNN outputs can often be linked to semantic parts of 
common objects and we are therefore using them as part 
proposals. Our part model learning has to select only a few 
parts for each view of an object from an already high qual¬ 
ity pool of part proposals. This allows for a much simpler 
and faster part model creation without the need to explicitly 
consider appearance of the individual parts as done in pre- 
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vious works [16, ]. At the same time, we do not need any 
ground-truth part locations or bounding boxes. 

The obtained approach and learning algorithm improves 
the state-of-the-art in fine-grained recognition on three 
datasets including CUB200-2011 [35] if no ground-truth 
part or bounding box annotations are available at all. In ad¬ 
dition, we show how to use the same approach for generic 
object recognition on Caltech-256. This is a major differ¬ 
ence to previous work on fine-grained recognition, since 
most approaches are not directly applicable to other tasks. 
For example, our approach is able to achieve state-of-the-art 
performance on Caltech-256 without the need for expensive 
dense evaluation on different scales of the image [3 ]. 

Furthermore, our work has impact beyond fine-grained 
recognition, since our method can also be used to guide 
data augmentation during fine-tuning for image classifica¬ 
tion. We demonstrate in our experiments that it even yields 
a more discriminative CNN compared to a CNN fine-tuned 
with ground-truth bounding boxes of the object. 

In the next section, we give a brief overview over re¬ 
cent approaches in the areas of part constellation models 
and fine-grained classification. Sect. 3 reviews the approach 
of Simon et al. [29] for part proposal generation. In Sect. 4, 
we present our flexible unsupervised part discovery method. 
The remaining paper is dedicated to the experiments on sev¬ 
eral datasets (Sect. 5) and conclusions (Sect. 6). 

2. Related work 

Part constellation models Part constellation models de¬ 
scribe the spatial relationship between object parts. There 
are many supervised methods for part model learning which 
rely on ground-truth part or bounding box annotations [41, 
18, 29]. However, annotations are often not available or ex¬ 
pensive to obtain. In contrast, the unsupervised setting does 
not require any annotation and relies on part proposals in¬ 
stead. It greatly differs from the supervised setting as the 
selection of useful parts is crucial. We focus on unsuper¬ 
vised approaches as these are the most related to our work. 

One of the early works in this area is [42], where facial 
landmark detection was done by fusing single detections 
with a coupled ray model. Similar to our approach, a com¬ 
mon reference point is used and the position of the other 
parts are described by a distribution of their relative po¬ 
lar coordinates. However, they rely on manually annotated 
parts while we focus on the unsupervised setting. Later 
on, Fergus et al. [16] and Fei-Fei et al. [I A] build models 
based on generic SIFT interest point detections. The model 
includes the relative positions of the object parts as well 
as their relative scale and appearance. While their inter¬ 
est point detector delivers a number of detections without 
any semantics, each of the CNN-based part detectors we 
use correspond to a specific object part proposal already. 
This allows us to design the part selection much more effi¬ 


cient and to speed up the inference. The run time complex¬ 
ity compared to [16, 1 ] decreases from exponential in the 
number of modeled parts to linear time complexity. Similar 
computational limitations occur in other works as well, for 
example [27]. Especially in the case of a large number of 
part proposals this is a significant benefit. 

Yang et al. [37] select object part templates from a set 
of randomly initialized image patches. They build a part 
model based on co-occurrence, diversity, and fitness of the 
templates in a set of training images. The detected ob¬ 
ject parts are used for part-based fine-grained classification 
of birds. In our application, co-occurrence and fitness are 
rather weak properties for the selection of CNN-based part 
proposals. For example, detectors of frequently occurring 
background patterns such as leaves of a tree would likely be 
selected by their algorithm. Instead our work considers the 
spatial relationship in order to filter unrelated background 
detectors that fire on inconsistent relative locations. 

Crandall et al. [11] improve part model learning by 
jointly considering object and scene-related parts. However, 
the number of combinations of possible views of an object 
and different background patterns is huge. In contrast, our 
approach selects the part proposals based on the relative po¬ 
sitions which is simpler and effective since we only want to 
identify useful part proposals for classification. 

In the area of detection, there are numerous approaches 
based on object parts. The deformable part model (DPM, 
[15]) is the most popular one. It learns part constellation 
models relative to the bounding box with a latent discrimi¬ 
native SVM model. Most detection methods require at least 
ground-truth bounding box annotations. In contrast, our ap¬ 
proach does not require such annotations or any negative 
examples, since we learn the constellation model in a gen¬ 
erative manner and by using object part proposals not re¬ 
stricted to a bounding box. 

Fine-grained recognition with part models Fine¬ 
grained recognition focuses on visually very similar classes, 
where the different object categories sometimes differ only 
in minor details. Examples are bird species [3' ] or car mod¬ 
els [21] recognition. Since the differences of small parts of 
the objects matter, localized feature extraction using a part 
model plays an important role. 

One of the earliest work in the area of fine-grained recog¬ 
nition uses an ellipsoid to model the bird pose [13] and 
fuse obtained parts using very specific kernel functions [40]. 
Other works build on deformable part models [15]. For ex¬ 
ample, the deformable part descriptor method of [41] uses 
a supervised version of [15] for training deformable part 
models, which then allows for pose normalization by com¬ 
paring corresponding parts. The work of [17] and [18] 
demonstrated nonparametric part detection for fine-grained 
recognition. The basic idea is to transfer human-annotated 
part positions from similar training examples obtained with 


nearest neighbor matching. Chai et al. [8] use the detections 
of DPM and the segmentation output of GrabCut to predict 
part locations. Branson et al. [1 ] use the part locations to 
warp image patches into a pose-normalized representation. 
Zhang et al. [39] select object part detections from object 
proposals generated by Selective Search [33]. The men¬ 
tioned methods use the obtained part locations to calculate 
localized features. Berg et al. ] learns a linear classifier for 
each pair of parts and classes. The decision values from nu¬ 
merous of such classifiers are used as feature representation. 
While all these approaches work well in many tasks, they 
require ground-truth part annotations at training and often 
also at test time. In contrast, our approach does not rely 
on expensive annotated part locations and is fully unsuper¬ 
vised for part model learning instead. This also follows the 
recent shift of interest towards less annotation during train¬ 
ing [39, 36, 29]. The method of Simon et al. [29] presents a 
method, which requires bounding boxes of the object during 
training rather than part annotations. They also make use of 
neural activation maps for part discovery, but although our 
approach does not need bounding boxes we are still able to 
improve over their results. 

The unsupervised scenario that we tackle has also been 
considered by Xiao et al. [3f ]. They cluster the channels of 
the last convolutional layers of a CNN into groups. Patches 
for the object and each part are extracted based on the ac¬ 
tivation of each of these groups. The patches are used to 
classify the image. While their work requires a pre-trained 
classifier for the objects of interest, we only need a CNN 
that can be pre-trained on a weakly related object dataset. 

3. Deep neural activation maps 

CNNs have demonstrated an amazing potential to learn 
a complete classification pipeline from scratch without the 
need to manually define low level features. Recent CNN 
architectures [22, 31] consist of multiple layers of convo¬ 
lutions, pooling operations, full linear transformations and 
non-linear activations. 

The convolutional layers convolve the input with numer¬ 
ous kernels. As shown by [38], the kernels of the convo¬ 
lutions in early layers are similar to the filter masks used 
in many popular low level feature descriptors like HOG or 
SIFT. Their work also shows that the later layers are sen¬ 
sitive to increasingly abstract patterns in the image. These 
patterns can even correspond to whole objects [30] or parts 
of objects [2S ] and this is exactly what we exploit. 

The output / of a layer before the fully-connected layers 
is organized in multiple channels 1 < p < P with a two- 
dimensional arrangement of output elements, i.e. we denote 
/ by (I)) where I G denotes the input image 

and j and f are indices of the output elements in the chan¬ 
nel. Fig. 2 shows examples of such a channel output for the 
last convolutional layer. As can be seen the output can be 



Figure 2. Examples for the output of a channel of the last convolu¬ 
tional layer and the corresponding neural activation maps for two 
images (index of the channel is skipped to ease notation). A deep 
red corresponds to high activation and a deep blue to no activation 
at all. Activation maps are available in higher resolution and better 
suited for part localization. Best viewed in color. 


interpreted as detection scores of multiple object part de¬ 
tectors. Therefore, the CNN automatically learned implicit 
part detectors relevant for the dataset it was trained from. 
In this case, the visualized channel shows high outputs at 
locations corresponding to the head of birds and dogs. 

A disadvantage of the channel output is its resolution, 
which would not allow for precise localization of parts. Due 
to this reason, we follow the basic idea of [30] and [2' ] 
and compute deep neural activation maps. We calculate the 
gradient of the average output of the channel p with respect 
to the input image pixels I x ^ y : 



The calculation can be easily achieved with a back- 
propagation pass [29]. The absolute value of the gradient 
shows which pixels in the image have the largest impact on 
the output of the channel. Similar to the actual output of 
the layer, it allows for localizing image areas this channel 
is sensitive to. However, the resolution of the deep neural 
activation maps is much higher (Fig. 2). In our experiments, 
we compute part proposal locations for a training image I{ 
from these maps by using the point of maximum activation: 


phi# = argmax 

x,y 



( 2 ) 


Each channel of the CNN delivers one neural activation map 
per image and we therefore obtain one part proposal per 
channel p. RGB images are handled by adding the absolute 
activation maps of each input channel. Hence we reduce 
a deep neural activation map to a 2D location and do not 
consider image patches for each part during the part model 
learning. In classification, however, image patches are ex¬ 
tracted at predicted part locations for feature extraction. 











The implicit part detectors are learned automatically dur¬ 
ing the training of the CNN. This is a huge benefit compared 
to other part discovery approaches like poselets [ ], which 
do not necessarily produce parts useful for discrimination 
of classes a priori. In our case, the dataset used to train the 
CNN does not necessarily need to be the same as the final 
dataset and task for which we want to build part representa¬ 
tions. In addition, determining the part proposals is nearly 
as fast as the classification with the CNN (only 110ms per 
image for 10 parts on a standard PC with GPU), which al¬ 
lows for real-time applications. A video visualizing a bird 
head detector based on this idea running at lOfps is available 
at our project website. We use the part proposals throughout 
the rest of this paper. 

4. Unsupervised part model discovery 

In this section, we show how to construct effective part 
models in an unsupervised manner given a set of training 
images of an object class. The resulting part model is used 
for localized feature extraction and subsequent fine-grained 
classification. In contrast to most previous work, we have a 
set of robust but not necessarily related part proposals and 
need to select useful ones for the current object class. Other 
approaches like DPM are faced with learning part detectors 
instead. The main consequence is that we do not need to 
care about expensive training of robust part detectors. Our 
task simplifies to a selection of useful detectors instead. 

As input, we use the normalized part proposal locations 
Hi,p £ [0, l] 2 for training image i = 1,..., N and part 
proposal p — 1,..., P. The P part proposals correspond to 
the channels an intermediate output layer in a CNN and p 
is determined by calculating the activation map of channel p 
for input image i and locating the maximum response. If the 
activation map of a channel is equal to 0, the part proposal 
is considered hidden. This sparsity naturally occurs due to 
the rectified linear unit used as a nonlinear activation. 

4.1. Random selection of parts 

A simple method to build a part model with multiple 
parts is to select M random parts from all P proposals. For 
all training images, we then extract M feature vectors de¬ 
scribing the image region around the part location. The fea¬ 
tures are stacked and a linear SVM is learned using image 
labels. This can even be combined with fine-tuning of the 
CNN used to extract the part features. Further details about 
part feature representations are given in Sect. 5. 

In our experiments, we show that for generic object 
recognition random selection is indeed a valid technique. 
However, for fine-grained recognition, we need to select the 
parts that likely correspond to the same object and not a 
background artifact. Furthermore, using all proposals is not 
an option since the feature representation increases dramat¬ 
ically rendering training impractical. Therefore, we show in 


the following how to select only a few parts with a constel¬ 
lation model to boost classification performance and reduce 
computation time for feature calculation significantly. 

4.2. Constellations of neural activations 

The goal is to estimate a star shape model for a subset 
of selected proposals using the 2D locations of all part pro¬ 
posals of all training images. Similar to other popular part 
models like DPM [15], our model also incorporates multi¬ 
ple views v = 1,..., V of the object of interest. For ex¬ 
ample, the front and the side view of a car is different and 
different parts are required to describe each view. 

Each view consists of a selection of M part proposals de¬ 
noted by the indicator variables b v ^ p G {0,1} and we refer 
to them as parts. In addition, there is a set of corresponding 
shift vectors d V:P G [—1, l] 2 . The shift vectors are the ideal 
relative offset of part p to the common root location of 
the object in image i. The are latent variables since no 
object annotations are given during learning. 

Another set of latent variables si yV G {0,1} denotes the 
view selection for each training image. We assume that 
there is only one target object visible in each image and 
hence only one view is selected for each image. Finally, 
hi, P £ {0,1} denotes if part p is visible in image i. In our 
case, the visibility of a part is provided by the part proposals 
and not estimated during learning. 

Learning objective We identify the best model for the 
given training images by maximum a-posteriori estimation 
of all model and latent parameters T = (6, d, s, a) from 
provided part proposal locations pi\ 

T = argmax r p (T | fi) . (3) 

In contrast to a marginalization of the latent variables, we 
obtain a very efficient learning algorithm. We apply Bayes’ 
rule, use the typical assumption that training images and 
part proposals are independent given the model parame¬ 
ters [1], assume flat priors for a (no prior preference for the 
object’s center) and d (no prior preference for part offsets), 
and independent priors for b and s\ 

argmax p (/x | 6, d, s, a) • p{b) • p(s) 

r 


The term p (p i^ p | 6, d, s, a ) is the distribution of the pre¬ 
dicted part locations given the model. If the part p is used 
in view v of image i, we assume that the part location is 
normally distribution around the root location plus the shift 
vector, i.e. fi^ p ~ N(d v , p + a^a^pE) with E denoting 
the identity matrix. If the part is not used, there is no prior 
information about the location and we assume it to be uni¬ 
formly distributed over all possible image locations in I{. 


Y[p(Vi, P \b,d,s,a) \p(b)-p(s) (4) 


D=1 


= argmax 


n 


Hence, the distribution is given by 

p(^i, p \b,d,s,a) = (5) 

f[ M (n itP | a-i + d VtP , al v E ) U - • 1 , 

where ti jVjP = Si^ v b v ^ p hi, p G {0,1} indicates whether part 
p is used and visible in view u which is itself active in im¬ 
age i. The prior distribution for the part selection b only 
captures the constraint that M parts need to be selected, i.e. 
Vu : M = Yl p =i b v ,p- The prior for the view selection s in¬ 
corporates our assumption that only a single view is active 
in training image i, i.e. Vi : 1 = Y^=i s i,v • In general, we 
denote the feasible set of variables as M. Exploiting this 
and applying log simplifies Eq. (4) further: 

n p v 

argmin -VVy t itV)P log A f (n itP \ a-i + d V:P , ) 

FeM {=1 p=l V = l 

In addition, we assume the variance a^ p to be constant for 
all parts of all views. Hence, the final formulation of the 
optimization problem becomes 

N P V 

argmin ^ ^ ^ ^ Sj,yby,phj, p ||/^i,p (6) 

reM i=i p =i v =i 

Optimization Eq. (6) is solved by alternately optimizing 
each of the model variables b and d, as well as the latent 
variables a and s, independently, similar to the standard 
EM algorithm. For each of the variables b and s, we can 
calculate the optimal value by sorting error terms. For ex¬ 
ample, b v ^ p is calculated by analyzing 

p v N 

argmin by,p f^A s i,yhi,p M a i d v , P \\ ) (7) 

t>er b p=1 v=1 i=1 

S -v-' 

E(v,p) 

This optimization can be intuitively solved. First, each 
view is considered independently, as we select a fixed num¬ 
ber of parts for each view without considering the others. 
For each part proposal, we calculate E{v,p). This term 
describes, how well the part proposal p fits to the view v. 
If its value is small, then the part proposal fits well to the 
view and should be selected. We now calculate E (v,p) for 
all parts of view v and select the M parts with the small¬ 
est value. In a similar manner, the view selection s can be 
determined. 

The root points a are obtained for fixed 6, s, and d by 

d'i — ^ ^ ti,v,p (AA d Vt)P ) / ( ^ ^ ti,v' ,p') • (^) 

v,p v' ,p' 

Similarly, we obtain the shift vectors d V:P : 


N N 

dy,p = ^ ^ tj' ,v,p ’ {f^i,p Q"i) / ^ ti',V,p)- ( 9 ) 

i=l i'=1 

The formulas are intuitive as, for example, the shift vectors 
d v ^ p are assigned the mean offset between root point and 
predicted part location pLi :P . The mean, however, is only 
calculated for images in which part p is used. 

This kind of optimization is comparable to the EM- 
algorithm and thus shares the same challenges. Especially 
the initialization of the variables is crucial. We initialize a 
to be the center of the image and s as well as b randomly to 
an assignment of views and selection of parts for each view, 
respectively. The initialization of d is avoided by calculat¬ 
ing it first. The value of b is used to determine convergence. 
This optimization is repeated with different initializations 
and the result with the best objective value is used. 
Inference The inference step for an unseen test image is 
similar to the calculations during training. The parameters 
s and a are iteratively estimated by solving Eq. (7) and (8) 
for fixed learned model parameters b and d. The visibility 
is again provided directly by the neural activation maps. 

5. Experiments 

The experiments cover three main aspects and applica¬ 
tions of our approach. First, we present a data augmenta¬ 
tion technique based on the part models of our approach 
for fine-tuning, which outperforms fine-tuning on bounding 
boxes. Second, we apply our approach to fine-grained clas¬ 
sification, a task in which most current approaches rely on 
ground-truth part annotations [7, 39, 29]. Finally, we show 
how to use the same approach for generic image classifica¬ 
tion, too, and present the benefits in this area. Code for our 
method will be made available. 

5.1. Experimental setup 

Datasets We use five different datasets in the exper¬ 
iments. For fine-grained classification, we evaluate our 
approach on CUB200-2011 [35] (200 classes, 11788 im¬ 
ages), NA birds [34] (555 classes, 48562 images), Stan¬ 
ford dogs [20] (120 classes, 20580 images), Oxford flow¬ 
ers 102 [24] (102 classes, 8189 images), and Oxford-IIIT 
Pets [25] (37 classes, 7349 images). We use the provided 
split into training and test and follow the evaluation proto¬ 
col of the corresponding papers. Hence we report the over¬ 
all accuracy on CUB200-2001 and the mean class-wise ac¬ 
curacy on all other datasets. For the task of generic object 
recognition, we evaluate on Caltech 256 [ 19 ], which con¬ 
tains 30607 images of a diverse set of 256 common objects. 
We follow the evaluation protocol of [31] and randomly se¬ 
lect 60 training images and use the rest for testing. 

CNNs and parameters Two different CNN architectures 
were used in our experiments: the widely used architecture 
of Krizhevsky et al. [22] (AlexNet) and the more accurate 
one of Simonyan et al. [31] (VGG19). In case of NA birds, 



we use GoogLeNet [31 ]. For details about the architecture, 
we kindly refer the reader to the corresponding papers. It 
is important to note that our approach can be used with 
any CNN. Features were calculated using the relu6 , relu7 
and pool5/7x7sl layer, respectively. For the localization 
of parts, the pool5 layer was used. This layer consists of 
256 and 512 channels resulting in 256 and 512 part propos¬ 
als, respectively. In case of the CUB200-2011, NA birds, 
Oxford dogs, pets and flowers datasets, fine-tuning with our 
proposed data augmentation technique is used. We use two- 
step fine-tuning [7] starting with a learning rate of 0.001 
and decrease it to 0.0001 when there is no change in the 
loss anymore. In case of Stanford dogs, the evaluation with 
CNNs pre-trained on ILSVRC 2012 images is biased as the 
complete dataset is a subset of the ILSVRC 2012 training 
image set. Hence, we remove the testing images of Stan¬ 
ford dogs from the training set of ILSVRC 2012 and learned 
a CNN from scratch on this modified dataset. The trained 
model is available on our website for easy comparison with 
this work. 

If not mentioned otherwise, the learned part models use 
5 views and 10 parts per view. A model is learned for each 
class separately. The part model learning is repeated 5 times 
and the model with the best objective value was taken. We 
count in how many images each part is used and select the 
10 most often selected parts for use in classification. 
Classification framework We use the part-based classi¬ 
fication approach presented by Simon et al. [ 29 ]. Given the 
predicted localization of all selected parts, we crop square 
boxes centered at each part and calculate features for all 
of them. The size of these boxes is given by \/\-W - H, 
A G {| , jq } , where W and H are the width and height 
of the uncropped image, respectively. If a part is not vis¬ 
ible, the features calculated on a mean image are used in¬ 
stead. This kind of imputation has comparable performance 
to zero imputation, but yields in a slight performance gain 
in some cases. In case of CUB200-2011, we also estimate 
a bounding box for each image. Selective Search [ 33 ] is 
applied to each image to generate bounding box propos¬ 
als. Each proposal is classified by the CNN and the pro¬ 
posal with the highest classification confidence is used as 
estimated bounding box. 

The features of each part, the uncropped image and the 
estimated bounding box are stacked and classified using a 
linear SVM. In case of CUB200-2011, flipped training im¬ 
ages were used as well. Hyperparameters were optimized 
using cross-validation on the training data of CUB200-2011 
and used for the other datasets as well. 

5.2. Data augmentation using part proposals 

Fine-tuning is the adaption of a pre-learned CNN to a 
domain specific dataset. It significantly boosts the perfor¬ 
mance in many tasks [ 3 ]. Since the domain specific datasets 
are often small and thus the training of a CNN is prone to 
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Figure 3. Overview of our approach to filter object proposals for 
fine-tuning of CNNs. Best viewed in color. 


Train. Anno. 

Method 

Accuracy 

Bbox 

Fine-tuning on cropped images 

67.24% 

None 

No fine-tuning 

63.77% 

None 

Fine-tuning on uncropped images 

66.10% 

None 

Fine-tuning on filtered part proposals 

67.97% 


Table 1. Influence of the augmentation technique used for fine- 
tuning in case of AlexNet on CUB200-2011. Classification ac¬ 
curacies were obtained by using 8 parts as described in Sect. 5.3. 


overfitting, the training set is artificially enlarged by using 
“data augmentation”. A common technique used for exam¬ 
ple by [22, 31] is random cropping of a large fixed sized 
image patch. This is especially effective if the training im¬ 
ages are cropped to the object of interest. If the images 
are not cropped and no ground-truth bounding box is avail¬ 
able, uncropped images can be used instead. However, fine- 
tuning is less effective as shown in Tab. 1. Since ground- 
truth bounding box annotations are often not available or 
expensive to obtain, we propose to fine-tune on object pro¬ 
posals filtered by a novel selection scheme instead. 

An overview of our approach is shown in Fig. 3. 

First, we select for each training image the five parts 
of the corresponding view, which fit the model best. Sec¬ 
ond, numerous object proposals are generated using Selec¬ 
tive Search [ 33 ]. These proposals are very noisy, i.e. many 
only contain background and not the object of interest. We 
count how many of the predicted parts are inside of each 
proposal and select only proposals containing at least three 
parts. The remaining patches, « 48 on average in case of 
CUB200-2011, are high quality image regions containing 
the object of interest. Finally, fine-tuning is performed us¬ 
ing the filtered proposals of all training images. 

The result of this approach is shown in Tab. 1. Fine- 
tuning on these patches provides not only a gain even com¬ 
pared to fine-tuning on cropped images, it also eliminates 
the need for ground-truth bonding box annotations. 











Train. 

Anno. 

Test 

Anno. 

Method 

Accuracy 

Parts 

Bbox 

Bbox CNN features 

56.00% 

Parts 

Bbox 

Berg et al. [4 ] 

56.78% 

Parts 

Bbox 

Goering et al. [18] 

57.84% 

Parts 

Bbox 

Chai et al. [ 1] 

59.40% 

Parts 

Bbox 

Simon et al. [29] 

62.53% 

Parts 

Bbox 

Donahue et al. [ 2] 

64.96% 

Parts 

None 

Simon et al. [29] 

60.55% 

Parts 

None 

Zhang et al. [3 ( ] 

73.50% 

Parts 

None 

Branson et al. [7] 

75.70% 

Bbox 

None 

Simon et al. [29] 

53.75% 

None 

None 

Xaio et al. [36] (AlexNet) 

69.70% 

None 

None 

Xaio et al. [36] (VGG19) 

77.90% 

None 

None 

No parts (AlexNet) 

52.20% 

None 

None 

Ours, rand., Sect. 4.1 (AlexNet) 

60.30 ± 0.74% 

None 

None 

Ours, const., Sect. 4.2 (AlexNet) 

68.50% 

None 

None 

No parts (VGG19) 

71.94% 

None 

None 

Ours, rand., Sect. 4.1 (VGG19) 

79.44 ± 0.56% 

None 

None 

Ours, const., Sect. 4.2 (VGG19) 

81.01% 


Table 2. Species categorization performance on CUB200-2011. 


5.3. Fine-grained recognition without annotations 

Most approaches in the area of fine-grained recognition 
rely on additional annotation like ground-truth part loca¬ 
tions or bounding boxes. Recent works distinguish be¬ 
tween several settings based on the amount of annotations 
required. The approaches either use part annotations, only 
bounding box annotations, or no annotation at all. In ad¬ 
dition, the required annotation in training is distinguished 
from the annotation required at test time. Our approach 
only uses the class labels of the training images without ad¬ 
ditional annotation. 

CUB200-2001 The results of fine-grained recognition on 
CUB200-2011 are shown in Tab. 2. We present three dif¬ 
ferent results for every CNN architecture. “No parts” cor¬ 
responds to global image features only. “Ours, rand.” and 
“Ours, const.” are the approaches presented in Sect. 4.1 and 
4.2. As can be seen in the table, our approach improves 
the work of Xiao et al. [3£] by 3.1%, an error decrease 
of more than 16%. It is important to note that their work 
requires a pre-trained classifier for birds in order to select 
useful patches for fine-tuning. In addition, the authors con¬ 
firmed that they used a much larger bird subset of ImageNet 
for pre-training of their CNN. In contrast, our work is easier 
to adapt to other datasets as we only require a generic pre¬ 
trained CNN and no domain specific outside training data. 
The gap between our approach and the third best result in 
this setting by Simon et al. [ 29 ] is even higher with more 
than 27% difference. The table also shows results for the 
use of no parts and random part selection. As can be seen, 
even random part selection improves the accuracy by 8% 
on average compared to the use of no parts. The presented 
part selection scheme boosts the performance even further 


Train. 

Anno. 

Test 

Anno. 

Method 

Accuracy 

Parts 

Parts 

Horn et al. [34] 

75.0% 

None 

None 

No parts (GoogLeNet) 

63.9% 

None 

None 

Ours, const., Sect. 4.2 (GoogLeNet) 

76.3% 


Table 3. Species categorization performance on NA Birds. 


Method 

Accuracy 

Chai et al. [8] 

45.60% 

Gavves et al. [1 ] 

50.10% 

Chen et al. [ ] 

52.00% 

Google LeNet ft [28] 

75.00% 

No parts (AlexNet) 

55.90% 

Ours, rand., Sect. 4.1 (AlexNet) 

63.29 db 0.97% 

Ours, const., Sect. 4.2 (AlexNet) 

68.61% 


Table 4. Species categorization performance on Stanford dogs. 

to 68.5% using AlexNet and 81.01% using VGG19. 

NA birds The results of our approach on the relatively 
new NA birds dataset are shown in Tab. 3. The accu¬ 
racy without using any parts is only 63.9%. Similar to 
the CUB200-2011 dataset, there is a clear advantage of 
using parts selected by our approach with an accuracy of 
76.3%. Interestingly, the accuracy is very close to the one 
on CUB200, while there are more than 2.5 times more 
classes in NA birds. We outperform the baseline provided 
the authors using the approach of [ 7 ] even though we are 
not using any kind of part annotation. 

Stanford dogs The accuracy on Stanford dogs is given 
in Tab. 4. To the best of our knowledge, there is only one 
work showing results for a CNN trained from scratch ex¬ 
cluding the testing images of Stanford dogs. Sermanent et 
al. [28] fine-tuned the architecture of their very deep Google 
LeNet to obtain 75% accuracy. In our experiments, we used 
the much weaker architecture of Krizhevsky et al. and still 
reached 68.61%. Compared to the other non-deep architec¬ 
tures, this means an improvement of more than 16%. 
Oxford pets and flowers The results for the Oxford 
flowers and pets dataset are shown in Tab. 5 and 6. Our ap¬ 
proach consistently outperforms previous work by a large 
margin on both datasets. Similar to the other datasets, ran¬ 
domly selected parts already improve the accuracy by up to 
4%. Our approach significantly improves this even further 
and achieves 95.35% and 91.60%, respectively. 

Influence of the number of parts Fig. 7 provides insight 
into the influence of the number of parts used in classifica¬ 
tion. We compare to random part to the part constellation 
model based selection. In contrast to the previous experi¬ 
ments, one patch is extracted per part using A = jq. While 
random parts increase the accuracy for any amount of parts, 
the presented scheme clearly selects more relevant parts and 















Method 

Accuracy 

Angelova et al. [2] 

80.66% 

Murray et al. [23] 

84.60% 

Razavian et al. [26] 

86.80% 

Azizpour et al. [3] 

91.30% 

No parts (AlexNet) 

90.35% 

Ours, rand., Sect. 4.1 (AlexNet) 

90.32 ±0.18% 

Ours, const., Sect. 4.2 (AlexNet) 

91.74% 

No parts (VGG19) 

93.07% 

Ours, rand., Sect. 4.1 (VGG19) 

94.20 ± 0.23% 

Ours, const., Sect. 4.2 (VGG19) 

95.34% 


Table 5. Classification performance on Oxford 102 flowers. 


Method 

Accuracy 

Zeiler et al. [38] 

74.20% 

Chatfield et al. [9] 

78.82% 

Simony an et al. [3 ] + VGG19 

85.10% 

No parts (AlexNet) 

71.44% 

Ours, rand., Sect. 4.1 (AlexNet) 

72.39% 

Ours, const., Sect. 4.2 (AlexNet) 

72.57% 

No parts (VGG19) 

82.44% 

Ours, const., Sect. 4.2 (VGG19) 

84.10% 


Table 8. Accuracy on the Caltech 256 dataset with 60 training im¬ 
ages per category. 


Method 

Accuracy 

Bo etal. [5]. 

53.40% 

Angelova et al. [2] . 

54.30% 

Murray et al. [ 3]. 

56.80% 

Azizpour etal. [3]. 

88.10% 

No parts (AlexNet) 

78.55% 

Ours, rand., Sect. 4.1 (AlexNet) 

82.70 ± 1.64% 

Ours, const., Sect. 4.2 (AlexNet) 

85.20% 

No parts (VGG19) 

88.76% 

Ours, rand., Sect. 4.1 (VGG19) 

90.42 ± 0.94% 

Ours, const., Sect. 4.2 (VGG19) 

91.60% 


Table 6. Species categorization performance on Oxford-IIIT Pets. 



Table 7. Influence of the number of parts on the accuracy on 
CUB200-2011. One patch was extracted for each part proposal. 


helps to greatly improve the accuracy. 

5.4. From fine-grained to generic classification 

Almost all current approaches in fine-grained recogni¬ 
tion are specialized algorithms and it is hardly possible to 
apply them to generic classification tasks. The main reason 
is the common assumption in fine-grained recognition that 
there are shared semantic parts for all objects. Does that 
mean that all the rich knowledge in the area of fine-grained 
recognition will never be useful for other areas? Are fine¬ 
grained and generic classification so different? In our opin¬ 
ion, the answer is a clear no and the proposed approach is a 


good example for that. 

There are two main challenges for applying fine-grained 
classification approaches to other tasks. First, the semantic 
part detectors need to be replaced by more abstract interest 
point detectors. Second, the selection or training of useful 
interest point detectors needs to consider that each object 
class has its own unique shape and set of semantic parts. 
Our approach can be applied to generic classification tasks 
in a natural way. The first challenge is already solved by us¬ 
ing the part detectors of a CNN trained to distinguish a huge 
number of classes. Because of these properties, part propos¬ 
als can be seen as generic interest point detectors with a fo¬ 
cus on a special pattern. In contrast to semantic parts, they 
are not necessarily only recognizing a specific part of a spe¬ 
cific object. Instead, they capture interesting points of many 
different kinds of objects. The second challenge is tackled 
by building class-wise part models and selecting part pro¬ 
posals that are shared among most classes. However, even a 
random selection of part detectors turns out to increase the 
classification accuracy already. 

Caltech 256 The results of our approach on Caltech 256 
are shown in Tab. 8. The proposed methods improves the 
baseline of global features without oversampling by 1% in 
case of AlexNet and 1.6% in case of VGG19. While Si- 
monyan et al. achieves slightly higher performance, their 
approach is also much more expensive due to dense evalua¬ 
tion of the whole CNN over all possible crops at three differ¬ 
ent scales. Their best result of 86.2% is achieved by using 
a fusion of two CNN models, which is not done in our case 
and consequently not comparable. The results clearly shows 
that replacing semantic part detectors by more generic de¬ 
tectors can be enough to apply fine-grained classification 
approaches in other areas. Many current approaches in 
generic image classification rely on “blind” parts. For ex¬ 
ample, spatial pyramids or other oversampling methods are 
equivalent to part detectors that always detect something at 
a fixed position in the image. Replacing these “blind” detec¬ 
tions by more sophisticated ones in combination with class- 
wise part models is a natural improvement. 




















6. Conclusions 

This paper presents an unsupervised approach for the se¬ 
lection of generic parts for fine-grained and generic image 
classification. Given a CNN pre-trained for classification, 
we exploit the learned inherit part detectors for generic part 
detection. A part constellation model is estimated by ana¬ 
lyzing the predicted part locations for all training images. 
The resulting model contains a selection of useful part pro¬ 
posals as well as their spatial relationship in different views 
of the object of interest. 

We use this part model for part-based image classifica¬ 
tion in fine-grained and generic object recognition. In con¬ 
trast to many recent fine-grained works, our approach sur¬ 
passes the state-of-the-art in this area and is beneficial for 
other tasks like data augmentation and generic object clas¬ 
sification as well. This is supported by, among other results, 
a recognition rate of 81.0% on CUB200-2011 without addi¬ 
tional annotation and 84.1% accuracy on Caltech 256. 

In our future work, we plan to use the deep neural acti¬ 
vation maps directly as probability maps while maintaining 
the speed of our current approach. The estimation of object 
scale would allow for applying our approach to datasets in 
which objects only cover a small part of the image. Our 
current limitation is the assumption that a single channel 
corresponds to a object part. A combination of channels 
can be considered to improve localization accuracy. In ad¬ 
dition, we plan to learn the constellation models and the 
subsequent classification jointly in a common framework. 

7. Changelog 

• V3: Added results for NA birds 

• V2: Updated to camera ready version 
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